New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add PCI device interface and a vfio backend driver #487
Conversation
db86eb1
to
7c06e0b
Compare
Thanks for the PR! I am seeing some failures on the x86_64 tests:
as well as compile errors on mingw:
It will be great if you could try running the tests on your environment to confirm if it is failing there as well. I'll follow up in the next days with a more indepth review. |
Hi Octavian! Thanks for the comment! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe change the title as uio is no more used ?
the reason to use vfio instead is nice to have in the PR comments.
1e2fc33
to
dcfd93e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left several comments.
arch/lkl/mm/bootmem.c
Outdated
_memory_start = (unsigned long)lkl_ops->mem_alloc(mem_size); | ||
if (lkl_ops->pci_ops && lkl_ops->pci_ops->alloc) { | ||
_memory_start = | ||
(unsigned long)lkl_ops->pci_ops->alloc(mem_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When pci_ops is installed but not used, do we need to use pci_ops->alloc() ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, pci_ops->alloc() just call mmap inside, so I updated the patch and introduced a new interface lkl_ops->page_alloc(), whose implementation is equivalent to that of pci_ops->alloc().
Now, PCI-related functions are removed from kernel memory allocation.
https://github.com/lkl/linux/pull/487/files#diff-65cbb8e45dedadd081421ddc9002d12fR17
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, asking to @tavip for more comments if any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please elaborate why is page_alloc needed? The only user I see is in bootmem_init and there we already make sure that we skip host memory until it is paged aligned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! In this scenario (with the vfio backend), we can just use memory allocated by malloc().
Meanwhile, in other cases, I believe lkl should provide alternative memory allocation interface other than host_ops->alloc() for PCI DMA. For example, the architecture without IOMMU requires physically contiguous pages for bootmem, which is impossible to allocate with malloc(). Of course developers can override host_ops->alloc() for this, but it turns out inefficient memory consuming because host_ops->alloc() always has to return pages. This is why I proposed pci_ops->alloc() at the first commit.
However, your comment made me realize that the name "page_alloc" leads misunderstanding easily. What we need here is not a page allocation, but allocating memory suited for the PCI backend driver. I'm considering how I should fix and will apply that in a couple of days.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks for the detailed explanation. pci_ops->alloc() sounds good (maybe alter we can convert it into a host dma_alloc if there are other users).
One thing that is no clear to me, is who will use this memory? I don't think we need to allocate bootmem with it, perhaps just internal PCI buffers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally agree that allocating the whole lkl memory with dma_alloc is not a sophisticated way.
However, linux components assume that the memory allocated by kmalloc() or other memory allocating functions is DMA-capable.
Documentation/DMA-API-HOWTO.txt says
If you acquired your memory via the page allocator
(i.e. __get_free_page*()) or the generic memory allocators
(i.e. kmalloc() or kmem_cache_alloc()) then you may DMA to/from
that memory using the addresses returned from those routines.
This makes things difficult, because it means that dma-capable memory allocators are often not called. (If it did, we could just hook dma_alloc() and return DMA-capable memory.)
There are some ways to solve this, but none of them are workable at modern device drivers such as NIC and NVMe.
The solutions I examined (and discarded) are
- Using Zones to specify limited memory ranges for DMAs. This is for devices with limited address spaces, but does not work with the NVMe driver, which requires full address space.
- bounce buffers. PCI drivers for some ARM SoCs implements this. Does not work as well, because the NVMe driver requires coherent memory and does not invoke DMA sync operation.
So, IMO, the only way to implement this PR is to make whole bootmem DMA-capable, even though this seems ugly....
3c42fda
to
dd0f447
Compare
I created another PR to lkl/lkl-docker, which provides QEMU and a disk image for the test of this feature. As soon as the PR to lkl/lkl-docker is merged, I will create a new PR (or add a commit to this PR) to replace the docker image. |
fa70c05
to
862ef0d
Compare
LGTM. Could you also please rebase and remove the merge commit to keep the history clean. Thanks a lot for the hard work Shinichi! With regard to page_alloc/DMA stuff I think that it is something we can look into later. A combination of DMA zone and dma_ops should work, but I don't know how the NVMe driver operates to formulate a concrete suggestion. My rationale for avoiding bootmem as DMA capable is to avoid confusion in the host ops since in some environments / LKL application no DMA is required. |
LKL is now able to manage PCI devices by itself like DPDK! This commit implements a PCI bus driver and PCI device interface. The commit also includes a vfio-pci backend driver as a reference implementation, thus no extra kernel modules is needed if the PCI device is assigned to VFIO. I believe the vfio backend driver fulfills the need of most cases, but users can inject wrapper functions for another userspace PCI framework such as uio-pci-generic or a handmade kernel module. In either case, the framework should provide physically contiguous memory for DMA, because the kernel and some drivers (e.g. NVMe) assume its memory as physically contiguous. Signed-off-by: Shinichi Awamoto <shinichi.awamoto@gmail.com>
Thanks for all your support. Commits are rebased and merged. Also, .circleci/config.yaml refers to I agree that we need further discussion for better memory management. Ideally, we can fix the NVMe driver to allocate its memory from DMA zone. This would enable us to seperate bootmem and DMA mem completely. |
LKL is now able to manage PCI devices by itself like DPDK!
This commit implements a PCI bus driver and PCI device interface. The commit also includes a vfio-pci backend driver as a reference implementation, thus no extra kernel modules is needed if the PCI device is assigned to VFIO.
I believe the vfio backend driver fulfills the need of most cases, but users can inject wrapper functions for another userspace PCI framework such as uio-pci-generic or a handmade kernel module. In either case, the framework should provide physically contiguous memory for DMA, because the kernel and some drivers (e.g. NVMe) assume its memory as physically contiguous.
This change is